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In  this  paper,  we  recover  sparse  signals  from  their  noisy  linear 
measurements  by  solving  nonlinear  differential  inclusions,  which  we 
call  Bregman  ISS  and  Linearized  Bregman  ISS.  We  show  that  under 
proper  conditions,  there  exists  a  bias-free  and  sign-consistent  point 
on  their  solution  paths,  which  corresponds  to  a  signal  that  is  the  unbi¬ 
ased  estimate  of  the  true  signal  and  whose  entries  have  the  same  signs 
as  those  of  the  true  signs.  Therefore,  their  solution  paths  are  regular¬ 
ization  paths  better  than  the  LASSO  regularization  path,  since  the 
points  on  the  latter  path  are  biased.  We  also  show  how  to  efficiently 
compute  their  solution  paths  in  both  continuous  and  discretized  set¬ 
tings:  the  full  solution  paths  can  be  exactly  computed  piece  by  piece, 
and  a  discretization  leads  to  Linearized  Bregman  iteration,  which  is 
faster  and  easy  to  parallelize.  Theoretical  guarantees  such  as  sign- 
consistency  and  minimax  optimal  Z2-error  bounds  are  established  in 
both  continuous  and  discrete  settings  for  specific  points  on  the  paths. 
Early-stopping  rules  for  identifying  these  points  are  given.  The  key 
treatment  relies  on  the  development  of  differential  inequalities  for 
differential  inclusions  and  their  discretizations. 


1.  Introduction.  We  study  two  continuous  time  dynamics  Bregman 
ISS^  and  Linearized  Bregman  ISS,  as  well  as  the  forward-Euler  discretization 
of  the  latter,  for  recovering  a  sparse  unknown  signal  /?*  G  from  its  noisy 
linear  measurements 

(1.1)  y  =  Xp*  +  e. 

Here,  y  G  ML  is  a  measurement  vector,  X  =  [xi, . . .  ,Xp]  G  is  a  mea¬ 

surement  matrix,  and  e  is  unknown  random  noise.  We  allow  n  <  p  and 
assume  that  /3*  has  s  <  min{n,p}  nonzero  components.  For  convenience,  let 
S  =  supp(/3*)  and  T  be  its  complement,  i.e.  T  =  {i  :  =  0}. 

The  solution  path  {pt,f3t}t>o  of  Bregman  ISS  is  given  by  the  nonlinear 

Keywords  and  phrases:  Linearized  Bregman,  Differential  Inclusion,  Early  Stopping 
Regularization,  Statistical  Consistency 

^ISS  abbreviates  Inverse  Scale  Space,  a  name  adopted  from  the  imaging  literature 
[BOXG05].  There,  large-scale  image  features  are  recovered  before  small-scale  ones. 
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differential  inclusions: 

(1.2a)  pt  =  -X'^{y-X^t): 

n 

(l-2b)  pt£d\\/3t\\i, 

where  t  >  0  is  time,  pt  £  is  assumed  to  be  right  continuously  differentiable 
in  t,  pt  is  the  right  derivative  of  pt,  and  fit  is  assumed  to  be  right  continuous. 
The  inclusion  condition  (1.2b)  restricts  pt  to  a  subgradient  of  t'l-norm  at  ft, 
t  >  0.  The  initial  conditions  are,  typically,  po  =  0  and  fo  =  0.  We  will  see 
that  a  solution  to  (1.2)  exists  and  both  pt  and  Xft,  t  >  0,  are  unique.  In 
addition,  pt  is  piece-wise  linear,  and  there  exists  a  solution  path  ft  that  is 
piece- wise  constant.  The  entire  path  can  be  computed  at  finitely  many  break 
points. 

Linearized  Bregman  ISS  has  its  solution  path  {pt,ft}t>o  governed  by  the 
nonlinear  differential  inclusions; 

(1.3a)  pt  +  -ft  =  -X'^{y-Xft), 

K  n 

(l-3b)  Pt  £  «9||,0t||i, 

where  ac  >  0  is  a  constant.  Compared  to  (1.2a),  equation  (1.3a)  has  the 
additional  term  ^f.  As  k  — )■  oo,  (1.3)  is  reduced  to  (1.2),  and  the  solution 
path  of  (1.3)  may  converge  to  that  of  (1.2)  exponentially  fast  as  n  increases. 
We  will  see  that  (1.3)  has  a  unique  solution  path  pt  and  ft,  t  >  0,  which  are 
both  continuous. 

The  discretizations  of  (1.2)  and  (1.3)  are  known  as  Bregman  Iteration 
(equation  (3.7)  of  [YODG08])  and  Linearized  Bregman  Iteration  (equations 
(5.19-20)  of  [YODG08]).  They  were  introduced  in  the  literature  of  varia¬ 
tional  imaging  and  compressive  sensing  before  (1.2)  and  (1.3).  Through  a 
change  of  variable,  Bregman  Iteration  becomes  the  iteration  of  the  Aug¬ 
mented  Lagrangian  Method  [Hes69,  Pow67].  On  the  other  hand.  Linearized 
Bregman  Iteration  is  a  simple  two-line  iteration: 

(1.4a)  Pk+i  H — fk+i  =  Pk  — fk  H - X^{y  —  Xf^), 

K  K  n 

(l-4b)  pk£d\\fk\\i, 

which  is  evidently  a  forward  Euler  discretization  to  (1.3),  where  >  0  is  a 
step  size.  Define  Zk  =  Pk  +  }:fk-  Then  (1.4)  can  be  simplified  to: 

Zk+i  =  Zk  +  —X^{y  -  Xfk) 
n 

fk+i  =  K  ■  shrink(2;fc+i,  1), 


(1.5a) 

(1.5b) 
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where  the  mapping  shrink  is  defined  component-wise  as 

shrink(2;,  A)  :=  sign(z)  max{|z|  —  A,  0},  z,  A  G  M,  A  >  0. 

Note  that  shrink(z,  A)  is  the  unique  solution  to  the  convex  program: 

min  |x|  H - rix  —  z)"^. 

xeitt'  '  2A^  ' 

1.1.  Motivations  and  contributions.  Our  exposition  is  motivated  by  the 
fact  that  solution  path  {(3t}t>o  of  the  differential  inclusion  (1.2)  and  the 
sequence  {(dk}k>o  of  (1-4)  are  better  than  the  points  on  the  LASSO  regu¬ 
larization  path.  In  particular,  while  LASSO  regularization  path  is  always 
biased,  Pt  can  be  unbiased  when  the  correct  set  of  variables  is  reached. 

To  see  this,  consider  the  LASSO  problem  [Tib96], 

(1.6)  min  X\\/3\\i  +  ^\\y  -  Xj3\\l, 

where  for  the  convenience  of  comparison  we  replace  the  regularization  pa¬ 
rameter  A  by  f  =  1/A  in  the  following  equivalent  form 

(1.7)  min\\/3\\i  + ^\\y  -  XI3\\1 

y  In 

Aside  from  the  obvious  relation  t  =  1/A,  solution  /3  is  piece- wise  linear  in 
A  [EHJT04]  though  not  so  in  t.  Despite  this,  t  will  be  convenient  to  our 
analysis  by  reflecting  a  nature  of  time  evolution  of  the  solution. 

Since  (1.7)  is  a  convex  program,  (3t  is  a  solution  to  (1.7)  if  and  only  if  it 
obeys  the  first-order  optimality  conditions 

(1.8a)  ^=^X^{y-XPt), 

t  n 

(l-8b)  pt  G  9||^t||i, 

which  are  obtained  by  taking  the  subdifferential  of  the  objective  in  (1.7). 

It  is  well-known  that  LASSO  solution  j3t  is  biased  [FLOl].  For  example, 
considering  the  simple  case  that  n  =  p  =  1,  X  is  the  identity  and  y  >  0, 
then  (1.8)  yields 

(1.9)  /3i  =  I 

while  (1.2)  has  the  solution 


0,  if  f  <  1/y; 

y  —  l/t,  otherwise, 


(1.10) 


0,  iff<l/y; 
y,  otherwise. 
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which  is  unbiased  for  t  >  1/y  as  E[/3t]  =  E[y]  =  /3*. 

Moreover,  the  Linearized  Bregman  ISS  (1.3)  has  the  solution, 


(1.11) 


f  0,  if  t  <  l/y; 

\  y{l  —  otherwise. 


which  converges  to  the  unbiased  Bregman  ISS  estimator  exponentially  fast. 

Let  us  discuss  this  phenomenon  in  the  general  setting.  First,  let  the  or¬ 
acle  estimator  be  the  subset  least-squares  solution  /3*  given  the  true  set  of 
variables  S  by  an  oracle,  whose  nonzero  entries  are  given  by 

(1.12)  ‘  +  QxJXs)  ‘ixje. 

Clearly  /3J  ~  J\f{j3g,T,n)  where  T,n  =  ^  (^^X^Xs)  Since  in  expectation 
with  respect  to  noise,  E[/3*]  =  f3* ,  (3*  is  an  unbiased  estimate  of  (3*. 

In  reality  we  are  not  given  the  support  set  S,  so  the  following  two  prop¬ 
erties  are  used  to  evaluate  the  performance  of  an  estimator  (3. 

1.  Model  selection  consistency:  supp(/3)  =  S] 

2.  Asymptotic  normality:  ^/n{j3  —  /?*)  — )■  AA(0,  S*),  where 

E*  =  lim  (  lim  —XTXs 

n^oo  \n^oo  Tl 


Since  these  properties  hold  for  the  oracle  estimator,  they  are  often  referred 
to  as  the  oracle  properties. 

A  solution  mapping  j3t  :  [0,  oo)  — >■  gives  a  regularization  path.  Model 
selection  consistency,  also  known  as  path  consistency,  refers  to  the  exis¬ 
tence  of  a  point  f3r  on  this  path  that  selects  the  correct  variables,  namely, 
supp(/3r)  =  S.  Path  consistency  has  been  obtained  for  LASSO  by  establish¬ 
ing  the  stronger  property  of  sign  consistency,  that  is,  sign(/3r)  =  sign(/3*), 
under  certain  conditions  such  as  those  in  [ZY06,  ZouOG,  YL07,  Wai09].  Pro¬ 
vided  that  path  consistency  is  reached  at  r,  the  LASSO  estimate  f3r  is 
nonetheless  biased  since 

(1.13)  ^Pr,s  =  ( -X^Xs)  -Xjy  -  ( -XjXs] 

\n  J  n  \n  J  t 

where  pr  =  sign(/3r)  G  d||/3r||i.  The  first-term  on  the  right-hand  side  equals 
the  oracle  estimator  (3*^,  which  is  unbiased,  whereas  the  second-term  never 
vanishes  and  is  the  bias.  Hence,  the  oracle  properties  are  never  completely 
met  by  LASSO. 
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The  bias  can  be  removed  by  a  simple  differentiation  of  LASSO  solution. 
To  see  this,  by  multiplying  t  on  both  sides  of  (1.8a)  and  differentiating  it 
with  respect  to  t,  any  point  on  the  LASSO  path  satisfies 

(1.14)  =  + 

n 

With  path  consistency  assumed  at  time  t  =  t,  we  have  j3r,i  =  0,  M  i  ^  S, 
and  from  (1.14)  we  have 

(1.15)  Pr,S  =  -  ^S0T,S  +  T/3r,s)). 

Generically,  sign  consistency  occurs  in  a  neighborhood  and  thus  pr^s  =  0. 
Therefore, 

ks  +  rks  =  ( -Xjy  =  /3J, 

\n  J  n 

which  is  the  oracle  estimator  without  bias!  This  motivates  us  to  replace 
{j3t  +  tj3t)  in  (1.14)  by  just  Pt,  which  gives  the  differential  inclusions  (1.2a) 
of  Bregman  ISS.  Later  we  will  show  that  the  resulting  /3t  in  (1.2)  is  indeed 
unbiased. 

Therefore,  in  addition  to  giving  the  basic  solution  properties  such  as  exis¬ 
tence,  uniqueness,  and  (dis)continuity,  we  also  attempt  to  explain  the  good 
behaviors  of  the  new  solution  paths  and  sequence  by  establishing  their  path 
consistency  property.  Basically  we  argue  that 

1.  Under  nearly  the  same  conditions  for  LASSO  [ZY06,  ZouOG,  YL07, 
Wai09]  that  the  covariates  Xi  are  sufficiently  uncorrelated  and  the 
signal  /3g  is  strong  enough,  Bregman  ISS  (1.2)  with  a  proper  early 
stopping  rule  will  return  the  oracle  estimator; 

2.  Sign  consistency  and  Z2-error  bounds  of  minimax  rates  can  be  general¬ 
ized  to  the  Linearized  Bregman  iteration  (1.4)  and  its  limit  dynamics 
(1.3),  under  similar  conditions. 

Some  computational  aspects  are  reviewed  in  the  next  subsection. 

1.2.  Related  work. 

1.2.1.  Regularization  and  other  algorithms.  For  general  penalized  least 
square  problems,  [FLOl]  has  shown  that  no  convex  penalty  functions  can 
fully  achieve  the  oracle  properties  and  thus  one  has  to  resort  to  non-convex 
regularization,  whose  global  minimizer  is,  however,  algorithmically  difficult 
to  locate.  Alternatively,  one  can  apply  LASSO  for  variable  selection  and  then 
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remove  the  bias  in  LASSO  by  solving  a  subset  least  squares  in  the  second 
stage.  On  the  other  hand,  [OBG'’'05]  noticed  that  Bregman  iteration  may 
reduce  bias,  also  known  as  contrast  loss,  in  the  context  of  Total  Variation 
image  denoising.  In  this  paper,  we  shall  see  that  dynamics  (1.2)  can  au¬ 
tomatically  remove  bias  without  any  non-convexity  or  second-stage  subset 
least  squares.  It  is  a  different  kind  of  regularization  via  early  stopping. 

Early  stopping  regularization  has  been  studied  widely  in  linear  inverse 
problems,  e.g.  [EHN96],  and  recently  in  Boosting,  e.g.  [FriOl,  BY02,  YRC07]. 
In  fact,  Linearized  Bregman  iterations  can  be  viewed  as  an  extension  of 
Landweber  iteration  (also  called  L2-Boost  in  statistics), 

h+i=l5k  +  —X^{y-X(5u), 

n 

which  follows  the  primal  path  f3t  as  a  gradient  descent  method  solving  least 
square  problem.  To  have  solution  sparsity.  Linearized  Bregman  iterations 
(1.4)  adds  the  dual  path  pt  in  favor  of  sparse  solutions. 

Linearized  Bregman  iteration  (1.5)  is  shown  in  [YinlO]  equivalent  to  the 
gradient  ascent  iteration  applied  to  the  Lagrange  dual  of  the  problem 

(1.16)  min||/3||i -b  7^||/3||2  subject  to  Y/3  =  y. 

p  Zk 

In  particular,  [3^  converges  to  the  unique  solution  of  (1.16)  at  a  linear  rate 
(as  long  as  Y  7^  0  and  Xj3  =  y  has  a  solution);  see  [LY13].  In  addition, 
for  sufficiently  large  k,  the  solution  to  (1.16)  is  a  solution  to  the  basis  pur¬ 
suit  model  [CDS98],  which  is  (1.16)  without  noisy  settings,  early 

stopping  regularization  is  necessary  for  signal  recovery.  The  results  in  this 
paper  basically  say  that  under  nearly  the  same  condition  as  LASSO,  Breg¬ 
man  ISS  with  early  stopping  regularization  may  recover  the  signal  without 
bias.  We  note  that  such  dynamics  can  be  easily  extended  to  general  settings 
with  differentiable  convex  loss  and  non-differentiable  convex  penalty,  e.g. 
Linearized  Bregman  iteration  in  matrix  completion  [CCSIO]. 

One  should  not  confuse  Linearized  Bregman  iteration  (1.5)  with  itera¬ 
tive  soft-thresholding  algorithm  (ISTA),  which  has  been  widely  used  under 
different  names  in  the  literature  (for  example,  see  [DJ95,  Don95,  CDLL98, 
DD02,  DDD04]), 

h+i  =  shrink(/3fc  -b  —X'^{y  -  Xj3k),  Xk)- 
n 

By  moving  the  shrinkage  operator  to  a  different  place  in  (1.5),  Linearized 
Bregman  iteration  generates  a  sparse  solution  path  with  early  stopping  reg¬ 
ularization,  while  ISTA  exploits  Xk  as  the  regularization  parameter  and  its 
iterates  converge  to  a  LASSO  solution. 
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1.2.2.  Parallel  and  distributed  computing.  It  is  very  easy  to  implement 
iteration  (1.5)  in  parallel  and  distributed  manners  and  apply  it  to  very  large- 
scale  datasets.  Suppose 


X  =  [Xi,  X2,  ...,  Xi] 


where  X^s  are  submatrices  stored  in  a  distributed  manner  (on  a  set  of 
networked  workstations).  The  sizes  of  X^’s  are  flexible  and  can  be  chosen 
for  good  load  balancing.  Let  each  workstation  i  hold  data  y  and  X^,  and 
variables  Zk/  and  Wk/  ■=  X£(3k,i,  which  are  parts  of  Zk  and  summands  of 
Wk  ■=  Xfdki  respectively.  The  iteration  (1.5)  is  carried  out  as 


for  £  =  1, . . . ,  L  in  parallel: 


Zk+i/  =  Zk,e  +  ^Xj {y  -  Wk), 
Wk+i,£  =  ^Xishvmk{zk+i,£,  1), 


L 

all-reduce  summation:  Wk+i  =  E  Wk+i,e, 

£=1 


where  the  all-reduce  step  collects  inputs  from  and  then  returns  the  sum  to 
all  the  L  workstations.  It  is  the  sum  of  L  n-dimensional  vectors,  so  no  matter 
how  the  all-reduce  step  is  implemented,  the  communication  cost  is  indepen¬ 
dent  of  p.  It  is  important  to  note  that  the  algorithm  is  not  changed  at  all.  In 
particular,  distributing  the  data  into  more  computing  units,  i.e.,  increasing 
L,  does  not  increase  the  number  of  iterations.  Therefore,  the  parallel  imple¬ 
mentation  is  nearly  embarrassingly  parallel  and  truly  scalable.  In  addition, 
it  is  also  possible  to  develop  implementations  for  data  divided  into  blocks  of 
rows  of  X  or  even  smaller  subblocks  that  split  both  rows  and  columns.  Re¬ 
cently,  (1.5)  has  also  been  extended  in  [YLYR13]  to  a  decentralized  setting 
where  not  only  data  and  computation  are  distributed  but  communication 
is  restricted  to  computing  units  with  direct  communication  links  so  there  is 
no  data  fusion  center  or  long  distance  communication.  The  scheme  fits  sen¬ 
sor  network  or  multi-party  regression  over  the  internet,  where  long-distance 
communication  incurs  long  delays  and  high  costs. 


1.3.  Notation  and  assumptions.  We  introduce  the  following  notation  and 
assumptions  to  /?*,  X,  and  e. 

•  Let  the  true  support  be  denoted  by  5  =  supp(/I*)  =  {i  :  13*  /  0},  and 
T  =  S'^  he  its  complement.  Clearly,  S'UT  =  {1, ...  ,p}. 

•  Xs  denotes  the  submatrix  of  X  formed  by  the  columns  of  X  in  S, 
which  are  assumed  to  be  linearly  independent.  Similarly  define  X^  so 
that  [X5  Xt]  =  X. 
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•  Assume  e  ~  It  can  generalize  to  sub-Gaussian  without 

violating  most  of  our  results. 

Define  {u,v)  =  u^v  and  {u,v)n  =  for  u,  u  G  W^.  Hence  ||tt||,i  =  ;^||u||. 

Let  X*  =  ^X'^  be  the  adjoint  operator  of  X  with  respect  to  inner  product 
Let  the  largest  and  the  smallest  nonzero  magnitudes  of  /3*  be  /Jj^ax 
max(|/3*|  :  i  £  S)  and  :=  min(|/3*|  :  i  £  S),  respectively.  Similarly  dehne 
and  f3X^  for  the  oracle  estimator  (3*  in  (1.12).  The  dependence  of  pt 
and  j3t  (or  equivalently  p{t)  and  j3{t))  on  t  is  omitted  where  it  is  clear  from 
the  context.  For  the  reason  to  be  discussed  in  Section  2,  we  shall  assume 
that  Pt  is  right  continuously  differentiable  and  (5t  is  right  continuous. 

Throughout  the  paper,  given  two  numbers  a  and  b,  let  a  V  b  :=  max(a,  b). 

1.4.  Outline.  In  the  rest  of  this  paper,  we  establish  basic  solution  prop¬ 
erties  of  Bregman  and  Linearized  Bregman  ISS  in  Section  2.  Section  3  and 
Section  4  describe  statistical  consistency  properties  of  Bregman  ISS  and  their 
generalizations  to  Linearized  Bregman  ISS/discretization,  respectively.  Sec¬ 
tion  5  is  dedicated  to  the  ideas  of  proofs.  Section  6  provides  some  preliminary 
numerical  results.  Discussions  and  conclusions  are  summarized  in  Section  7. 

2.  Bregman  and  Linearized  Bregman  solution  paths.  The  solu¬ 
tion  to  Bregman  ISS  (1.2)  is  a  piece- wise  regularization  path  given  iteratively 
as  follows,  starting  with  k  =  0,  to  =  0,  and  po  =  Po  =  0: 


1. 

set 

:=  sup{t  >  tk  ■■  ptk  +  - 

XPt,) 

e  dWPtJi};  if  tk+i 

oo. 

then 

exit] 

2. 

set 

Pik  +  l 

■■=  Ptk  + 

X^{y- 

m); 

3. 

set 

Sk+1 

:=  {*  :  l(P4+i)i| 

=  1}  and  Tk+i  = 

=  {1,.. 

■  ,p}  \  Sk+i; 

4. 

set 

f^tk  +  l 

as  any  solution 

to 

miiiyg 

l|y- 

-xpwi 

(2.: 

I) 

subject  to 

(Pife+i , 

)iPi  >  0 

V  i 

£  *S'fc+i, 

/3,=0 

V  3 

£  Tk+i- 

5. 

set 

k  =  k 

:  -|-  1  and  go  to  Step  1. 

Paper  [BMB013]  gives  an  algorithm  but  does  not  establish  the  uniqueness 
of  solution  path.  One  can  show  that  the  solution  to  (1.2)  is  piece-wise: 

(2.2)  1^'  =  .  t  ^ 

In  other  words,  pt  is  piece-wise  linear  and  Pt  is  piece-wise  constant.  The 
following  theorem  presents  some  general  conditions  to  ensure  the  existence 
and  uniqueness  of  solution  path. 
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Theorem  2.1  (Solution  existence  and  uniqueness  for  Bregman  ISS).  Let 
pt  be  right  continuously  differentiable  and  ft  be  right  continuous.  Then,  a  so¬ 
lution  to  (1.2)  is  given  by  {ft,  Pt)  generated  by  the  above  algorithm.  Solution 
Pt  and  Xft  are  unique.  In  addition,  if  the  columns  xi  of  X  for  i  G  supp{ft) 
are  linearly  independent  for  t  >0,  then  ft  is  also  unique. 


Proof.  The  existence  part  follows  from  [BMB013]. 

We  show  that  the  uniqueness  part.  Define  f{f)  :=  ^||y  —  W/3|p.  Then, 
the  differential  inclusion  (1.2)  is  equivalent  to 

(2.3a)  Pt  =  -Vfift), 

(2.3b)  Pt  G  9||/3t||i, 

Let  5+  :=  {i  :  {pt)i  =  1},  Sf  :=  {i  :  {pt)i  =  -1},  and  St  =  Sf  U  Sf .  By 
(1.2b),  in  the  case  of  St  =  0,  we  have  ft  =  0,  so  —Xf{ft)  =  —  V/(0)  is 
unique.  In  the  case  of  St  /  0,  we  show  below  that  Xft  and  —Vf{ft)  are 
both  unique.  The  uniqueness  of  pt  follows  from  these  results  and  (2.3a). 

In  fact,  (1.2a)  and  (1.2b)  impose  the  following  constraints  on  fp. 


(2.4) 


'  {ft)i  >  0  and  {Xf{ft)),  >0,  Vi  G  5+, 
<  {ft)i  <  0  and  (V/(/3t)),  <0,  Vi  G  Sf , 
Sft)i  =  0,  yi^St. 


To  see  how  V f{ft)  is  involved,  notice  that  {Vf{ft))i  >  0  must  hold  for 
Vi  G  S^  since  {pt)i  G  [—1, 1]  is  already  at  its  maximal  value  1  and  X f{ft)  <  0 
is  forbidden  as  it  would  further  increase  {pt)i  to  an  impossible  value.  The 
same  argument  holds  for  iX f{ft))t  <  0  for  Vi  G  Sf . 

Furthermore,  we  will  have  {ft)i-{X f{ft))t  =  0  for  all  i.  To  see  this,  assume 
{ft)i  7^  0.  Then  by  the  right  continuity  assumption,  there  exists  an  interval 
[t,t  +  e)  in  which  ft  remains  nonzero  with  the  same  sign.  By  (2.3b),  {pt)i 
will  remain  either  +1  or  —1  in  the  same  interval,  so  iX f{ft))t  =  0.  On  the 
other  hand,  assume  iX f{ft))t  /  0.  Then  by  (2.3a),  pi  will  change  and  thus 
it  cannot  stay  either  +1  or  —1.  By  the  right  continuity  of  f,  it  must  hold 
that  {ft)i  =  0.  Therefore,  we  have  the  addition  constraints 


(2.5) 


{ft)^  •  {Xfift)),  =  0. 
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Conditions  (2.4)  and  (2.5)  are  precisely  the  KKT  optimality  conditions  for 


mm  /(/?) 


'/3i>0,  ViG5+, 


(2.6) 


subject  to  i  j3i<  0,  \/i  £  Sf-  , 

^Pi  =  o,  yi^St, 


which  is  identify  to  (2.1)  except  (2.1)  specifies  the  time  Let  /3t  be  the 
solution  to  problem  (2.6). 

In  general,  if  /  is  strictly  convex,  then  the  solution  f3t  is  unique.  In  our 
case,  /  is  not  necessarily  strictly  convex,  but  /  =  g{Xf3)  for  a  strictly  convex 
function  g.  Therefore,  Xfit  is  unique,  and  thus  so  is  V f{/3t)  =  X'^X g{Xj5t). 
Lastly,  I3t  is  unique  if  the  columns  of  X  corresponding  to  nonzero  entries  of 
(it  are  linearly  independent  since  X[it  is  unique.  □ 

The  existence  and  uniqueness  of  Linearized  Bregman  ISS  is  much  simpler 
as  shown  in  the  following  theorem. 

Theorem  2.2  (Solution  existence  and  uniqueness  for  Linearized  Bregman 
ISS).  Let  pt  he  right  continuously  differentiable  and  fit  he  right  continuous. 
Then  (1.3)  has  a  unique  solution. 

Proof.  Let  Zt  =  f{pt,  fit)  =  Pt  +  \  fit-,  then  /  is  an  injective  function  from 
the  admissible  set  (p, /3)  to  in  variable  t  and  fit  =  Kshrink(2;t,  1).  Now 
differential  inclusion  (1.3)  becomes  the  ODE 


Obviously,  g{x)  is  Lipschitz  continuous.  Therefore,  the  Picard-Lindelof  The¬ 
orem  implies  that  there  exists  a  unique  solution  to  this  ODE,  which  leads 


to  the  solution  of  (1.3). 


□ 


We  note  that  the  solution  of  (1.3),  though  not  piece-wise  linear  or  con¬ 
stant,  can  still  be  computed  in  a  piece- wise  closed  form  where  on  each  piece, 
the  signs  of  fit  remain  unchanged.  This  is  left  to  the  reader. 

Provided  that  sign  consistency  is  met  by  a  point  on  the  path  at  f  =  r,  (2.1) 
returns  an  oracle  solution  as  it  is  a  least-squares  problem  subject  to  only 
sign  constraints.  Hence,  natural  questions  are:  what  conditions  will  guarantee 
sign  consistency?  And,  how  to  determine  r?  In  the  sequel,  we  are  going 
to  provide  an  answer  to  this  question.  Throughout  the  remaining  of  this 
paper,  we  assume  that  pt  is  right  continuously  differentiable  and  fit  is  right 
continuous,  so  the  existence  and  uniqueness  of  solution  paths  are  guaranteed. 
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3.  Consistency  of  Bregman  ISS  Dynamics.  In  this  section  neces¬ 
sary  and  sufficient  conditions  are  established  for  noisy  sparse  signal  recovery 
with  Bregman  ISS  (1.2). 


3.1.  Assumptions. 

(Al)  Restricted  Strong  Convexity:  there  is  a  7  G  (0, 1], 

X*sXs  >  7I. 


(A2)  Irrepresentable  Condition:  there  is  a  7  G  (0, 1), 


X*  vt 
T^S 


00 


00 


where  xl:=Xs{^X^Xs)  \ 

Condition  Al  says  that  the  Hessian  matrix  of  the  empirical  risk  ^\\y  — 
A/3II2  restricted  on  the  index  set  5  x  5  is  strictly  positive  definitive,  so  the 
empirical  risk  is  strongly  convex  when  restricted  on  the  support  set  S.  Such 
a  condition  is  necessary  in  the  sense  that  once  it  fails,  Xs  will  be  linearly 
dependent  and  no  unique  representation  is  possible  under  the  basis  Xs- 
Condition  A2  says  that  the  absolute  row  sums  of  matrix  X^Xg  are  all 
less  than  one.  It  has  been  proposed  independently  under  a  variety  of  names, 
e.g.  Exact  Recovery  Condition  [Tro04],  Irrepresentable  Condition  [ZY06], 
among  [YL07,  Zou06].  Here  we  adopt  the  name  in  [ZY06]  as  it  refers  to  the 
fact  that  the  regression  coefficients  of  Xs  for  response  Xj  (j  G  T)  all  have 
fi-norm  less  than  one,  i.e. 


/3'  =  arg  min 


<  1, 


so  in  this  sense  one  cannot  represent  the  irrelevant  covariates  Xt  by  the 
relevant  ones  Xs  effectively. 

Neither  Al  nor  A2  can  be  checked  when  the  support  set  S  of  signal  is 
not  known.  Alternatively  we  can  use  a  more  strict  but  checkable  condition 
proposed  in  [DHOl]. 

(A3)  Mutual  Incoherence  Condition: 


jj  :=  max 
id 


-{Xi,Xj) 

n 


1 

(2s -1)’ 


s  =  1-5*1. 
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It  can  be  shown  [Tro04,  CWll]  that  once  A3  holds,  then  A1  and  A2 
simultaneously  hold  with 


7  =  l-/i(s-l) 

since  (1  —  fj,{s  —  l))Is  <  <  (1  +  /^(•s  —  and 

1-  u(2s-l) 
l-^(s-l)  • 

We  note  that  condition  A3  is  shown  to  be  sharp  in  the  noisy  case  in  [CWXIO]. 
With  these  one  can  translate  all  the  theoretical  results  with  condition  Al 
and  A2  into  condition  A3. 

3.2.  Mean  Bregman  ISS  Path  versus  LASSO  Path.  As  we  have  seen  in 
Section  1.1  near  equation  (1.14),  Bregman  ISS  (1.2)  can  be  derived  by  differ¬ 
entiating  LASSO’s  KKT  conditions.  Such  a  relation  can  be  seen  precisely  by 
considering  the  consistency  conditions  of  LASSO  on  the  following  temporal 
mean  path  of  Bregman  ISS: 

(3.1)  i3{t)  ■=  \  !  I3{s)ds. 

^  Jo 

According  to  Theorem  2.1  and  Condition  Al,  Bregman  ISS  path  /3t  is  unique 
and  thus  /3{t)  is  well  defined  as  long  as  supp(/3(s))  C  S',  s  E  [0,  t),  where  S 
is  the  true  support. 

A  connection  between  Bregman  ISS  and  LASSO  lies  in  the  same  condition 
under  which  their  paths  from  start  to  time  t  are  supported  within  the  true 
support  S.  In  addition,  the  Bregman  ISS  mean  path  I3{t)  is  identical  to 
the  LASSO  path  if  the  Bregman  ISS  path  is  incremental  with  only  adding 
variables,  but  without  dropping.  In  general,  the  two  paths  are  distinct. 

Theorem  3.1.  Let  {(3t,pt)  be  either  the  Bregman  ISS  path  (1.2)  or  the 
LASSO  path  (1.8)  with  p{t)  E  cl||/3t||i.  Assume  that  for  all  t  <t, 

(3.2)  \\X^XgPs(t)  +  tXf<PT€\\oo  <  1, 

where  Pgr  =  I  —  Pg  =  I  —  is  the  projeetion  matrix  onto  im(A5)“’“. 

Then  for  all  t  <  t, 

A.  the  Bregman  ISS  path,  its  mean  path,  and  the  LASSO  path  all  have 
supports  in  S; 

B.  the  mean  Bregman  ISS  path  f3{l/X)  is  piecewise  linear  with  X  =  l/t; 
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C.  if  the  Bregman  ISS  path  is  incremental  in  the  sense  that  St  =  supp(/3j) 
satisfies  St  C  5*^/  C  5  for  all  t  <  t'  <  t,  then  the  mean  Bregman  ISS 
path  is  identical  to  the  LASSO  path;  but  they  are  distinct  in  general. 

Remark  3.1.  In  particular  in  noiseless  setting,  e  =  0,  (3.2)  becomes 

<  1 

or  dropping  ps{t)  by 

WX^xlW^  =  <  1 

which  is  sufficient  and  necessary  to  guarantee  that  both  Bregman  ISS,  LASSO, 
and  OMP  [Tro04]  recovers  the  sparse  signal  in  noiseless  setting;  once  it  is 
violated  there  is  some  S -sparse  signal  for  which  these  methods  fail. 

Proof  of  Theorem  3.1.  Assume  there  exists  a  r  >  0,  such  that  for  all 
t  <  T,  solution  path  f3{t)  satisfies  supp(/3(t))  C  S.  Then  Bregman  ISS  (1.2) 
splits  into 

(3.3a)  PS  =  -X*sXs{fis  -  fi*s)  + 

(3.3b)  pT  =  -XifXsifis  -  fis)  +  X*Te. 

From  (3.3a)  one  gets  the  Bregman  ISS  solution 

(3.4)  fis{t)  =  fi*s-  {X*sXs)-^ps  +  {X*sXs)-^X*se, 
which  leads  to  the  following  equation  by  plugging  into  (3.3b) 

(3.5)  PT  =  Xtfxlps  +  X^Ptc. 

Integration  on  both  sides  of  this  equation  and  setting 

(3.6)  ||pT(i)||oo  =  \\XTXgps{t)  +  tX^PTe\\oo  <  1 

which  ensures  that  /3'r(t)  =  0.  So  is  the  mean  path. 

On  the  other  hand,  LASSO  starts  from  the  KKT  condition  (1.8)  which 
splits  into 

(3.7a)  ps/t  =  -X*sXsifis-fi*s)  +  X*se 

(3.7b)  pT/t  =  -X*TXs0s-fi*s)  +  XTe 

Following  the  same  trick  above  one  can  see  the  same  condition  (3.6)  is  met 
for  LASSO  to  ensure  firit)  =  0.  This  finishes  the  proof  of  part  A. 
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As  to  part  B,  for  t  <  t,  the  mean  path  is  obtained  by  integration  on  (3.3a) 

(3.8)  Ps{t)  =  ^  f/3sis)ds  =  P*s-  ^(X*sXs)-^ps{t)  +  {X*sXs)-^X*se. 

^  Jo  ^ 

Equation  (2.2)  implies  that  jpt  =  jPt^  +  Pt^+i  ’  which  is  piecewise 

linear  with  respect  to  A  =  1/t. 

To  see  part  C,  let  St  =  supp(/3t)  for  Bregman  ISS.  If  for  all  s  <  f  <  r, 
Ss  ^  St  ^  S,  then  similar  reasoning  as  above  implies  that  the  Bregman  ISS 
path  satisfies 

(3.9)  PsAt)  =  I3*s,  -  l{X*s^Xs,)-^psAt)  +  {X*s^Xs,)-^X*s^e. 

For  such  incremental  processes,  pstit)  =  sign(/35^(f))  =  sign(/35j(f))  which 
meets  the  LASSO  path  equation 

(3.10)  4(4)  =  4  -  )(44s.)-‘4(4)  +  (4.4.)-'4^4, 

where  St  =  supp(/3t)  for  LASSO.  But  such  a  relation  is  lost  when  variable 
dropping  happens.  □ 

Despite  of  the  difference  to  the  LASSO  path,  the  mean  Bregman  ISS  path 
may  reach  statistical  model-selection  consistency  under  the  same  conditions 
as  LASSO. 


Theorem  3.2  (Sign  Consistency  of  Mean  Path).  Let 

rj  I  n  f 
2ctV  logp  Vier  " 

Assume  that  both  (A.l)  and  (A. 2)  hold.  Then  the  following  holds. 

A.  (No- false-positive)  the  mean  path  has  no-false-positive  before  time 
T,  i.e.,  yt  <T  supp(/St)  C  S,  with  probability  at  least  1  — 

B.  (No- false-negative  for  Mean  Path)  moreover  if  the  signal  is  strong 
enough  such  that  >  ci/f, 


Cl  = 


^maXjgT  \\Xj\\n 


+  II  (^5^5) 


-1| 


then  with  probability  at  least  1  —  ^^-^===,  the  mean  path  has  no 
false-negative,  i.e.  sign(,dT=)  =  sign(/3*). 
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Remark  3.2.  Under  the  same  conditions  as  LASSO  with  X*  =  Ijf 
[Wai09],  the  mean  path  /3  of  Bregman  ISS  reaches  sign- consistency.  These 
conditions  are  sufficient  and  necessary  in  the  sense  that  once  violated,  there 
exists  an  instance  such  that  the  probability  of  failure  will  be  larger  than  1/2 
due  to  noise.  In  this  sense,  the  mean  path  estimator  f3{f)  is  “statistically 
equivalent”  to  the  LASSO  estimator. 

The  mean  Bregman  ISS  path  geometrically  sheds  light  on  why  LASSO 
incurs  bias  while  Bregman  ISS  can  avoid  it.  The  LASSO  path,  likes  the  mean 
Bregman  ISS  path,  involves  some  kind  of  averaging  that  ensures  the  path 
continuity  but  causes  bias.  The  Bregman  ISS  path  is  piecewise  constant, 
allows  it  to  be  bias-free. 

Now  we  need  to  answer  the  following  question:  what  are  conditions  to 
ensure  the  sign  consistency  of  the  Bregman  ISS  path! 


3.3.  Consistency  of  Bregman  ISS.  The  following  theorem  tells  us  that 
under  the  irrepresentable  (incoherence)  condition,  the  Bregman  ISS  dynam¬ 
ics  always  evolves  in  the  support  of  true  signals  in  the  early  stage;  further¬ 
more  if  the  signal  is  strong  enough  then  the  dynamics  will  pick  up  all  the  true 
variables  before  selecting  any  incorrect  ones.  When  such  a  sign  consistency 
is  reached,  Bregman  ISS  returns  the  oracle  estimator  which  is  unbiased. 


Theorem  3.3  (Sign  Consistency  of  Bregman  ISS).  Let 

-1 


n 


r  :=  (  max||Xj||„ 

2ct  V  logp  \j&T 

Assume  that  both  (A.l)  and  (A. 2)  hold.  Then  Bregman  ISS  (1.2)  has  paths 
satisfying: 

A.  (No- false-positive)  the  path  has  no-false-positive  before  time  t,  i.e. 
'it  <T  supp(/3t)  C  S,  with  probability  at  least  1  —  p^T^iogpl 

B.  (Sign- consistency)  moreover  if  the  signal  is  strong  enough  such  that 


(3.II)  /3T,  > 


4ct  8(t(2 -L  logs)  (maxjgT  ||A'j||n)\  logp 


/1/2 


V 


n 


Then  with  probability  at  least  1  —  =  sign(/3*). 


Remark  3.3.  Once  the  sign  consistency  holds,  /3{t)  meets  the  oracle 
estimator  ffi  which  is  unbiased  and  has  a  h-error  rate  ||/3(t)  —  j3*\\2  < 
Oio-sJ s  log  s/n),  even  better  than  the  I2- error  rate  0{a^y s  logp/n)  for  the 
LASSO  estimator  which  is  minimax  optimal  up  to  a  logarithmic  factor 
[RWYllj. 
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To  have  sign  consistency,  Theorem  3.3  makes  a  strong  signal  condition 
with  a  lower  bound  on  However  even  without  such  a  strong  signal 

assumption,  the  minimax  optimal  /2-error  rates  can  be  achieved  disregarding 
sign  consistency. 

Theorem  3.4  (Minimax  Optimal  /2-Error  Bound).  Assume  that  both 
(Al)  and  (A2)  hold.  There  is  a  t  £  [0,r]  such  that  with  probability  at  least 
1 _ ,  2 

py/n  log  p  ’ 

Wt^T  -  /3*||2  <  —  (4:max\\Xj\\n  +  Vy/j] 

iTi  V  J  \  n 

The  existence  of  such  r  does  not  ensure  us  to  find  it  easily.  However  one 
can  use  f  at  a  cost  of  enlarging  the  constants  by  a  square  root  of  condition 
number  of  S5  =  XgXs- 

Corollary  3.1.  Under  the  same  condition  of  Theorem  3. 4  and  assum¬ 
ing  an  upper  eigenvalue  hound  XgXg  <  3ma.yils,  then  the  following  holds  for 
all  t  £  \t,  fl  with  probability  at  least  1 - , 

^  ^  ^  ^  pVTriogp 

n.  ^  2ayyX{X*Xs) 

Wt-P  II2  <  - - -  ^4max\\X,\\^  +  rj^j^—— 

where  JC{XgXs)  =  7max/7  is  the  condition  number  of  XgXs- 

All  the  results  in  this  subsection  follow  from  the  more  general  results  on 
Linearized  Bregman  ISS  (1.3)  in  the  next  section  by  taking  k  — )•  0.  Therefore 
we  omit  the  proofs. 

4.  Generalizations  to  Linearized  Bregman  ISS  and  Its  Discretiza¬ 
tion.  In  this  section,  we  state  a  general  consistency  result  for  Linearized 
Bregman  ISS  (1.3)  and  Linearized  Bregman  Iterations  (1.4)  whose  proof  will 
be  given  in  the  next  section. 


4.1.  Consistency  of  Linearized  Bregman  ISS.  The  following  theorem  es¬ 
tablishes  general  conditions  for  statistical  consistency  of  Linearized  Bregman 
ISS  (LBISS)  (1.3). 


Theorem  4.1  (Consistency  of  LBISS).  Let 

( 


{1- B/{Kr]))r] 

T  :=  — 


2a 


n 


max||Ad|„ 
logp  \jeT 


-1 
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Assume  k  is  big  enough  to  satisfy 


f^max  +  2o''\ 


llogp  \\Xj3^ 


+ 


I2  +  2(7 Vs  logn  Ad/ 

-  =  B  <  KTj. 


nV7 


Then  (1.3)  has  paths  satisfying 

A.  (No- false-positive)  the  path  has  no- false-positive  before  time  r  ,i.e., 

\/t  <T  suppfA)  C  S,  with  probability  at  least  1 - ,  - ,  \ 

—  r-t-yr-L/  _  7  a  py/nlogp  ny/nlogn’ 

B.  (No- false-negative  for  Mean  Path)  moreover  if  the  signal  is  strong 
enough  such  that  >  ci/r, 


Cl  = 


f  (1  -  B/{Kr]))r] 


then  with  probability  at  least  1 - ,  - ,  ,  the  mean  path  Bit) 

^  ^  pVTTlogp  nyTTlogn^  ^  \  / 

satisfies  sign(;0f)  =  sigii(/5*); 

C.  (Sign- consistency  for  LBISS)  Moreover  if  the  smallest  magnitude 
/^min  strong  enough  and  k  big  enough  such  that 


0*  ■  > 
A^min  — 


1/2 


7 


4(7  /log/? 


n 


8  +  41ogs  1  ^3||/3*||2^ 

^min  ^  '^min 

then  with  probability  at  least  1  —  ~  >  sign(/?T=)  =  sign(/3*); 

D.  (l2-hound)  For  some  constant  C  and  k  large  enough  to  satisfy 


n 


C'y  y  log/?  2k7 


+  J_(i  +  i„g2l 


3*  1 1  2 


2  +  4(7"^slog/?/7 
C^s  log/? 


)  <  T, 


there  is  a  t  £  [0, r]  such  that  ||/3r  —  /3*||2  <  (C  +  with 

probability  at  least  1 - ,  - ,  \  . 

^  ^  pVTriogp  nyTTlogn 

Remark  4.1.  A.  For  sign- consistency  of  LBISS, 


A 


mm  — 


4(7  8cj(2  +  log  s)  (maXjgT’  \\Xj 


1/2 


V 


7 


77 


zs  enough  to  guarantee  the  existence  of  k. 
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B.  For  I2- consistency 


(j  y  So-  (maxjsT  ||-^il|n) 

VI 

is  enough  to  guarantee  the  existence  of  k. 

C.  Taking  k  =  00,  we  get  the  Theorem  3.3  for  Bregman  ISS. 

D.  Anl2-error  bound  of  the  same  rate  for  estimator  j3{f)  can  he  established 
using  the  monotonicity  of  \\Xs{(3g  —  f3sit))\\2  (see  Appendix)  for  t  <  f, 

wxsiPsjf) - mh  ^  \\xsmr)-m\2^ 

y/ivj  -  ,/hrf 

where  IC{XgXs)  is  the  condition  number  of  X^Xs- 


\\m-M2  < 
< 


4.2.  Consistency  of  Linearized  Bregman  iterations.  The  following  the¬ 
orem  establishes  statistical  consistency  conditions  for  Linearized  Bregman 
Iteration  (1.4). 


Theorem  4.2  (Consistency  of  Linearized  Bregman  Iterations).  Let  t^  = 


^Za:=o  and 

-  (1  -  B/{Kr]))'n  I  n  / 

2a  Y  logp  Y 

Assume  that  k  is  big  enough  to  satisfy 


max  ||X, 
j&T 


3 II" 


-1 


/^max  + 


logp 


+ 


I2  +  2 Vs  log  n  ^ 

-  -  ^  5 


nVr 


and  step  size  a  is  small  such  that  KcrUX^XVI  <  2.  Then  any  solution  path 
of  (1.3)  satisfies 

A.  (No- false-positive)  for  all  n  s.t.  t^  <  t,  the  path  has  no- false¬ 
positive  with  probability  at  least  1  —  ^^Yiogp  ~  nCi^iog n  ’  ®^PP(/^fc)  — 

B.  (Sign- consistency)  moreover  if  the  smallest  magnitude  is  strong 
enough  and  k  is  big  enough  to  ensure 


0*  ■  > 


4(7  logp 
7I/2  V  n 
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8  +  41ogs  ,  1  ^  , 311/3* 


+  3a  <  r, 


where  7  =  7(1  —  K,a\\XsX*g\\/2),  then  with  probability  at  least  1  — 
-  nvfel’  sign(/3fc*)  =  sign(/3*)  for  k*  =  max{/c  :  4  <  r}. 

C.  (l2-bound)  for  some  large  enough  eonstants  n  and  C  such  that 


4 


n 


+ 


1 


(1  +  log 


n 


3*  1 1 2 


2  +  4cj"^slogp/7^ 


logp  '  2ac7'  '  °  C^slogp 

2  1 


+  2a  <  T, 


with  probability  at  least  1 - ,  , - ,  ,  ,  there  is  a  k* ,  th*  <  r, 

^  ^  PVTTlOgp  nyTTlOgn^  ^  n.  —  7 


such  that 


-n2<{C  +  :^) 


s  logp 


Remark  4.2.  A.  Taking  a  — >■  0,  we  have  7  =  7,  and  Theorem  f.l  for 
Linearized  Bregman  ISS  follows. 

B.  The  condition  KaUX^X^H  <  2  is  necessary  to  ensure  the  convergence 
of  LB  algorithm  in  the  noiseless  case.  This  condition  also  guarantees 
the  monotonic  descent  of  \\Xs{l3s,k  —  /3s)||  before  r. 

5.  Analysis  of  ISS  Dynamics.  The  general  idea  to  analyze  differen¬ 
tial  inclusions  in  (1.2)  and  (1.3)  is  to  associate  these  dynamics  with  some 
potential  or  Lyapunov  functions,  which  control  a  fast  convergence  of  solu¬ 
tions  to  the  oracle  estimator.  The  restricted  strongly  convex  condition  Al 
suggests  us  that  when  the  solution  path  /3(t)  evolves  in  the  support  set  S,  a 
suitable  choice  of  potential  functions  should  be  expected  with  exponentially 
fast  decay,  which  enables  us  to  estimate  the  stopping  time  of  reaching  sign 
consistency  and  small  ^-error. 

The  difficulty  lies  in  that  ISS  dynamics  are  differential  inclusions,  hence 
we  exploit  differential  inequalities  of  such  a  potential  function  to  derive  the 
bounds. 


5.1.  Potential  function.  One  would  like  to  study  the  dynamics  of  the 
following  differential  inclusion 

(5.1a)  pt  +  -$t  =  -X*X{l3t-f*) 

tv 

(5-lb)  Pt  e  d\\/3t\\i, 

where  /3*  is  the  oracle  estimator.  Assuming  the  right  continuity  of  solutions 
and  multiplying  both  sides  above  by  /3(t)  —  /3*,  one  obtains  a  potential  or 
Lyapunov  function  'L  :  — )•  M))"  associated  with  the  dynamics 

4(»(ft))  =  -i||x(A-7)llt 
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where 


(5.2) 


and  D{I3,I3)  is  the  Bregman  distance 


(5.3) 


Dy(/3,/3)  :=  V{/3)  -  V{P)  -  {dV{P),P-  /3) 


induced  by  the  particular  convex  function  V(/3)  =  ||/3||i.  As  n  <C  p,  matrix 
X  has  a  large  null-space,  and  to  ensure  the  stationary  point  of  the  dynamics 
being  the  oracle  solution,  one  must  restrict  the  dynamics  evolving  outside 
the  subspace  ker(A). 

5.2.  Differential  inequality  with  restricted  exponential  decay  of  potential. 
Define  the  following  Oracle  Dynamics  as  if  an  oracle  discloses  the  true  vari¬ 
able  set  S  such  that  we  restrict  our  attention  on  a  subspace  defined  by  S, 


(5.4)  p'g  +  -f'g  —  -XgXsifd's  -  Psi'^)  ^  d\\(3's{t)\\i. 

rv 


Here  XgXs  is  a  s  x  s  symmetric  matrix  satisfying  the  strong  convexity 
^  lls,  which  will  lead  to  exponentially  fast  decay  of  potential  func¬ 
tion. 

To  reach  this  goal,  our  key  treatment  here  is  a  differential  inequality 
associated  differential  inclusion  in  Oracle  Dynamics  which  is  tight  enough 
to  ensure  the  exponential  decay  of  potential  function.  This  is  a  Bihari’s  type 
[Bih56]  nonlinear  differential  inequality,  which  generalizes  the  linear  cases  of 
Gronwall-Bellman  inequalities  [Grol9,  Bel43].  In  our  treatment,  a  piecewise 
continuous  bound  is  given  which  leads  to  the  tight  rates  in  this  paper. 

Lemma  5.1  (Generalized  Bihari’s  Inequality).  The  potential  T  of  the 
Oracle  Dynamics  above  satisfies  the  following  differential  inequality 


where  F  ^  is  the  right- continuous  inverse  of  the  following  strictly  increasing 
function 


(5.5) 


F{x)  2^  ^  ^rnin  Pmin  ^  ^  ^ 

(2y/fs 


'min 


min 
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Such  an  inequality  ensures  a  decrease  of  the  potential  function  at  a  fast 
enough  speed  which  leads  to  the  following  tight  estimates  on  stopping  time. 

We  are  concerned  with  the  following  stopping  time  reaching  sign-consistency 
and  Z2-consistency  of  Oracle  Dynamics,  respectively.  Define 

(5.6)  n  :=  inf{t  >  0  :  sign(/3(j)  =  sign(/35)}, 


(5.7)  f2(C):=mf{t>0:H/3's-^*sH2<C 


Equipped  with  the  generalized  Bihari’s  inequality,  one  can  build  up  the 
following  bounds  for  stopping  time  on  sign-consistency  and  /2-consistency, 
respectively. 


Lemma  5.2.  The  following  bounds  hold  for  the  Oracle  Dynamics  (5-4) 


,  4 -L  2 logs  1  P 

n  <  — = - 1 - log(^ 

7/3*  •  KJ  B* 

I  i^min  '  ^r) 


UC)  <  7^ 


n 


-L 


Cy  Y  logp  2k7 


1  ^ 
-(1  -Plog 


^lli 


C^s  logp 


Remark  5.1.  A.  fi  <  0{\ogs / B4a\n)  /^(^)  reach  sign- 

consistency  after  t  >  Oilogs/ Brain)-  The  factor  logs  is  due  to  the 
potential  method  above  which  converts  a  multidimensional  dynamics 
into  a  one- dimensional  differential  inequality,  and  dropping  potential 
exponentially  from  at  least  H/JJHi  >  sffnw  0  requires  necessarily  the 
O(logs)  time. 

B.  f2{C)  <  says  that  I2- consistency  can  be  reaehed  before  r  = 

as  long  as  C  is  a  sufficiently  large  constant. 


5.3.  Sign-consistency  and  l2-error  bound.  Now  we  are  ready  to  reach  the 
sign-consistency  and  /2-error  bound  for  Bit)  by  setting  fi  <  f  and  T2(C')  <  r, 
respectively.  In  these  cases,  Oracle  Dyanmics  (5.4)  f'git)  meets  the  original 
path  fsit)  when  restricted  on  S.  The  complete  proofs  of  Theorem  4.1  and 
its  discrete  version  of  Theorem  4.2  will  be  found  in  Appendix  A,  together 
with  their  supporting  lemmas. 
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6.  Data-dependent  Stopping  Rules  for  Bregman  ISS.  All  the 

previous  results  enable  us  to  select  f  as  a  stopping  time  which  however 
depends  on  unknown  parameters  7,  7,  and  noise  level  a,  hence  is  not  a 
data-dependent  stopping  rule.  In  this  section  we  present  two  preliminary 
results  with  early  stopping  rules  comparable  to  [CWll],  which  only  depend 
on  the  noise  level  a  and  thus  can  be  estimated  from  data.  We  leave  it  our 
future  work  to  explore  fully  adaptive  stopping  rules. 

In  the  following,  define  the  residue  r{t)  :=  y  —  Xj5{t).  The  first  theorem 
adopts  the  stopping  rule  based  on  ||r(t)||2  and  the  second  theorem  is  based 
on  ||Ar(t)||oo. 


Theorem  6.1.  Suppose 


/3i 


mm  — 


4cr  8cr(2  -|-  logs)  (max^gr  ||Aj||,i) 


1/2 


V 


7 


77 


Then  Bregman  ISS  with  the  stopping  rule  ||r(t)||2  <  cr^/ n  +  2^Jn\ogn  selects 
the  true  subset  S  with  probability  at  least  1  —  0(l/n). 


Remark  6.1.  •  This  result  is  comparable  to  Theorem  7  in  [CWll]. 

•  The  first  condition  on  the  minimum  of  magnitude  of  signals  ensures 
the  model  selection  consistency  of  the  Bregman  ISS  path  and  thus  indi¬ 
cates  that  one  can  find  some  t  along  the  path  so  that  the  residual  term 
satisfies  ||r(t)||2  <  cr^/ n  +  2y/n  log  n.  Once  the  path  achieves  sign  con¬ 
sistency,  the  Bregman  ISS  must  stop. 

•  The  second  condition  ^  guaran¬ 

tees  that  one  can  not  stop  earlier  before  Bregman  ISS  achieves  a  full 
recovery.  Note  that  as  n  — >■  00,  one  needs  which  is  a 

constant. 


Theorem  6.2.  In  addition  to  (3.11),  suppose 


O*  ^  2crmaxi  \\Xi\\n^/2{l  +  c)slogp 

Pmin  >  - ^ - ^  ^ 


/log  s 
nj 


Then  Bregman  ISS  with  the  stopping  rule  ||A^r(t)||oo  <  2uY^max7||A7|[k)^ 
(6  >  0)  selects  the  true  subset  S  with  probability  at  least  1  —  0(l/p  +  1/n). 
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Remark  6.2.  This  result  is  comparable  to  Theorem  8  in  [CWll],  though 
the  lower  bound  >  0(cry^ s  logp/n)  loses  a  factor  y/s  here.  As  n  ^  oo, 
the  lower  bound  can  be  arbitrarily  small. 


The  remaining  of  this  section  presents  the  proofs  of  the  theorems  above. 


Proof  of  Theorem  6.1.  Lemma  3  in  [CWll]  or  Lemma  5.2  in  [CXZ09] 
shows  that  with  probability  at  least  1  — 1/n,  e  is  essentially  I2  upper  bounded 

||e||2  <  cr\J n  +  2y/nlogn. 

Hence  with  the  same  probability, 

||r(r*)||  =  11(1  -  X5(X^X5)-iX5)e||2  <  ||e||2  <  a^n  +  2y/^A^ 

We  have  now  shown  that  the  Bregman  ISS  stops  once  the  path  acheives  sign 
consistency. 

Next  we  are  going  to  show  that  the  algorithm  will  not  stop  whenever  there 
is  some  i  €  S  such  that  f3i{t)  =  0.  By  Lemma  A. 5, 

lirtll  >  ||A5(^^-/3s(t))|| 

>  v^m-i3sm 

>  V^/^min 

>  2a^J  n  +  2y^nlog^ 

provided  that  Note  that 


/log  s 


||(X5X5)  ^X^ejloo  <  2(7  4/ - ,  w.  p.  at  least  1  —  2n 


nj 


so  it  suffices  to  have  /3T,  >  MVn+2V^+VMil_ 


□ 


Proof  of  Theorem  6.2.  By  assumptions 


/3. 


min  — 


4cr  8cr(2  +  logs)  (max^gr  ||Xj||„) 


1/2 


V 


7 


77 


Hence,  according  to  Theorem  4.2,  the  Bregman  ISS  achieves  the  sign  con¬ 
sistency  with  high  probability.  Assume  that  at  time  t*  ,  f3{T*)  has  the  same 
sign  as  the  underlying  sparse  signal  /?.  For  each  t, 


rt  =  {I-  Xsit){X*s^t)Xsit)r^X*^,^){Xsfds  +  e)  =  st  +  nt, 
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where  st  =  {I  —  Ps(t))Xsl3s  is  the  signal  part  of  the  residual  and  nt  = 
(/  —  Ps(t))^  is  the  noise  part  of  the  residual.  Then  r^*  =  Ur*-  Let  600  = 
cr  y^2(l  +  c)  maxj  ||Xj||  logp. 

Prob(||X^nt||oo  =  ||X^(/-Pt)e||oo  >  600)  <  J]]  Prob(|Xf  (/ -  >  ^00 

i 

<  ^Prob(|Xfe|  >  600) 

i 

^  2 

“  p^\/2  logp’ 

which  means  the  algorithm  stops  at  t*  . 

Next  we  are  going  to  show  that  the  algorithm  will  not  stop  whenever  there 
is  some  i  €  At  S  such  that  (3i{t)  =  0.  By  Lemma  A. 5, 


Wx^nWoo  =  \\x^[XsW*s-Ps{t))  +  {i-Psmoo, 

>  \\X^s[XsW*s-Ps{t))  +  {I-Ps)e]\U 

=  \\XlXs0*s-Ps{t))U,  xUl-Ps)e  =  0, 

>  ^\\X^Xs0*s-f3sm\2, 

y/S 

>  ^Il/9|-/Ss(t)ll2, 

VS 

>  >b 

—  /-yrain  —  "oo; 

ys 

provided  that  •  Note  that 


||(X^X5)-1a^£|U<2^7 
so  it  suffices  to  have 
t^rain  —  T  2(7 


/log  S 


,  w.  p.  at  least  1  —  2n 


log  s  cr(maxj  y^2(l  +  c)s  logp/7  +  ^/log  s) 


ny 


with  probability  at  least  1  —  0{p  ^  +  n 


□ 


7.  Experiments.  In  this  section  we  provide  some  experimental  results 
to  illustrate  the  relations  among  LASSO,  Bregman  ISS  (ISS)  and  Linearized 
Bregman  iteration  (LB). 
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In  this  experiment  we  choose  n  =  200,  p  =  100  and  only  the  first 
s  =  30  elements  of  (5  are  nonzero  (/3j  =  rj  +  sign(rj),  where  Vj  ~  AA(0, 1), 
j  =  1, . . . ,  30).  Each  sample  xi  is  drawn  from  the  distribution  A/’(0,  Sp).  We 
choose  Sp  =  (cTij),  where  aij  =  1  if  f  and  Oij  =  l/(3p)  otherwise.  In 
such  a  setting,  the  Irrepresentable  (Incoherence)  Condition  holds  with  high 
probability,  since  Sp  is  nearly  identity  matrix.  We  choose  noise  level  cr  =  1 
here,  considering  the  choice  that  the  magnitude  of  /3i  is  0(1). 

Figure  1  is  an  example  of  regularization  path  of  three  methods.  As  k 
goes  bigger,  the  LB  path  becomes  closer  to  that  of  ISS.  For  LB  we  choose 
na  =  1/2  such  that  the  step  size  of  gradient  decent  is  1/2,  to  satisfy  the 
convergence  condition.  Note  that  if  a  is  too  big,  the  solution  is  oscillating. 

To  compare  the  performance  of  three  methods  quantitatively,  we  choose 
the  AUC  of  ROC  curve,  to  measure  the  goodness  of  three  regularization 
paths.  ROC  (receiver-operating-characteristic)  curve  is  plotted  by  thresh¬ 
olding  the  regularization  parameter  A  in  LASSO,  t  in  ISS,  or  k  in  LB  at 
different  levels  which  create  different  true  positive  rates  (TPR)  and  false 
positive  rates  (FPR): 

TPR  H^{Sde.cted  True  Variables}  ppj^  ^{Selected  False  Variables] 
^{True  V ariables]  ’  ^{False  Variables} 

ROC  is  a  curve  from  (0, 0)  to  (1, 1).  AUC  (Area  Under  the  Curve)  means 
the  area  under  the  ROC  curve.  Large  AUC  values  indicate  that  the  signals 
are  picked  out  earlier  than  noise  on  regularization  paths.  Repeating  the 
experiments  for  100  times,  in  Table  1  we  report  the  mean  AUC  with  standard 
deviations  for  the  three  methods  at  different  noise  levels.  It  shows  that  all 
the  three  methods  work  reasonably  well  in  this  example,  while  Bregman  ISS 
performs  slightly  better  than  LASSO.  As  k  becomes  bigger,  the  performance 
of  LB  gets  closer  to  that  of  Bregman  ISS.  Notice  that  as  noise  level  a  gets 
larger,  all  the  methods  have  their  performance  decay  since  signal  and  noise 
get  confused. 


(T 

LB(k  =  4) 

LB(k  =  64) 

LB(k  =  1024) 

ISS 

LASSO 

1 

0.9771(0.0124) 

0.994(0.0069) 

0.9947(0.0065) 

0.9948(0.0064) 

0.9945(0.0068) 

3 

0.9604(0.0169) 

0.9867(0.009) 

0.9882(0.0083) 

0.9884(0.0082) 

0.9879(0.0086) 

5 

0.9393(0.0226) 

0.9659(0.0188) 

0.9673(0.0188) 

0.9676(0.0187) 

0.9671(0.0187) 

Table  1 

Mean  AUC  (standard  deviation)  for  three  methods  at  different  noise  levels  (a):  ISS  has  a 
slightly  better  performance  than  LASSO  in  terms  of  AUC  and  as  k  increases,  the 
performance  of  LB  approaches  that  of  ISS.  As  noise  level  a  increases,  the  performance  of 

all  the  methods  drops. 
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Lasso 


ISS 


Fig  1.  Regularization  path  of  LASSO,  Bregman  ISS,  and  Linearized  Bregman  Iterations 
with  different  choices  of  k  (ko  =  1/2/  As  k  grows,  the  paths  of  Linearized  Bregman 
iterations  approach  that  of  Bregman  ISS. 


8.  Discussion  and  Conclusion.  In  this  paper,  noisy  sparse  signal 
recovery  is  approached  via  two  continuous  dynamics,  called  Bregman  ISS 
and  Linearized  Bregman  ISS,  where  a  discretization  of  the  latter  leads  to 
the  widely  used  Linearized  Bregman  Iteration  algorithm.  Equipped  with 
an  early  stopping  regularization,  Bregman  ISS  can  simultaneously  achieve 
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model  selection  consistency  and  unbiased  estimation.  As  a  discretization  of 
Linearized  Bregman  ISS  paths,  model  selection  consistency  and  minimax  op¬ 
timal  Z2-error  bounds  for  Linearized  Bregman  Iteration  are  also  established. 
Some  data-dependent  stopping  rules  are  given  for  Bregman  ISS  solution 
paths. 

Future  directions  of  our  study  include  fully  data-dependent  stopping  rules 
and  generalization  of  our  results  in  nonlinear  settings. 

APPENDIX  A:  PROOFS 
A.l.  Proof  of  Consistency  of  LBISS. 

Lemma  A. I.  Assume  that  Xs  has  full  column  rank. 

(A)  For  all  t  <t,  solution  of  (1.3)  I3{t)  contains  no  false  positive  if 

\\^T^]s{PS  +  f^s/^)  +  I^tPt^Woo  <  1)  '^t  <  T, 

where  Ft  =  I  —  X'^Xg  is  the  projection  operator  onto  the  column  space 
of  Xt. 

(B)  Mean  path  /3(r)  is  sign- consistent  if 

sign(/35(r))  =  sign(/3^  -L  -  -d>5^(p5(r)  -L  ^l3s{r))  =  sign(/3J) 

T  Hh 

where  =  ^^XjXs. 

No-false-positivity  and  the  sign-consistency  for  mean  path  in  Theorem 
4.1,  directly  follow  this  lemma. 

Proof  of  Lemma  A.l.  Consider  the  differential  inclusion  (1.3) 

p+-$  =  --X^(Xf5  -y)  =  -X*X(p  -  P*)  +  A*e 
K  n 

Assume  there  exists  a  r  >  0,  such  that  for  all  t  <  t,  solution  path  f3{t) 
contains  no  false-positive,  i.e.  supp(/3(t))  C  S.  Then  for  all  t  <  t, 

(A.l)  ps  +  $s/^^  =  -X*sXs{f3s-P*s)  +  X*se, 

and 

(A.2)  p^  +  p^/^  =  -XfXs{Ps-/3*s)  +  X^e. 

From  (A.l)  one  gets  -il3s  - /3*s)  =  {X)^Xs)~^{ps  +  $s/ k)  -  {X^Xs)~^X)^e, 
which  leads  to  the  following  equation  by  plugging  into  (A.2) 

PT  +  $t/h  =  X^xlips  +  /3s/k)  +  X^Ptc 


28 


OSHER,  RUAN,  XIONG,  YAO  AND  YIN 


where  Pt  =  I  —  Ps  =  I  —  is  the  projection  matrix  onto  im(X'r). 

Integration  on  both  sides  and  setting 


\\pT{t)  +  /3Tit)/K\\oo  =  \\X^xl{ps{t)  +  Ps{t)/K)  +  tX^Prelloo  <  1 

the  first  part  follows  from  Prit)  =  n  ■  shrink(/3r(t)  +  j3T{t)/ k,  1). 

The  second  part  is  obtained  by  integration  on  (A.l) 

^s{r)  =  -  [  Ps{t)dt  =  f3*s-  -<^s^{ps{t)  +  -/3s(r))  + 

'T  Jo  T  ^ 

followed  by  taking  sign(,d5(r))  =  sign(/3J). 

Lemma  A. 2.  Suppose  e  ~  X(0,iT^/n),  and  X  G 

< 


□ 


(A. 3)  Prob(||X'^e||oo  >  o-a/2(1  + /x)  logpmax  ||Xj||;  ^  - , 

j  p^^^/^^logp 

(A. 4)  Prob(||X^e||2  >  crJ 2(1  +  /r)tr(X'^X)  logp)  <  / 

''  pf^y/TTlOgp 

Proof  of  Lemma  A. 2.  From  the  Gaussian  tail  probability  bound, 

1 


ProbdXj’el  >  crv^2(l  +  p)  logp||Xj 


<  2 


< 


y^2dY+~pyio^^/^ 

1 


2(1+m)  logP 

e  2 


pl+M-^TT  logp 


The  first  inequality  is  directly  the  union  bound  of  index  j.  The  second  in¬ 
equality  is  obtained  by  the  fact 

{e  :  ||X^e||2  >  a^J 2(1  ^)tr(X'^X)  logp}  G  |J{e  :  |Xje|  >  crv^2(l  -h //)  logp||Xj||}, 


which  ends  the  proof. 

Proof  of  Lemma  5.1.  Denote 

At  =  {iG  5|sign(/3*)  /  sign(/3')}  C  S. 

Noticed  that 

'  -fd'sWl  > 

i&At 

>  max{^X^  X]  (Z] 

isAt  isAt 

>  max{/3X^D(/3J,  /3^)/2,  D(/3^,  P's)V^s} 


□ 
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and 

m  -  Psh  <  ^*min  =  D{P*s,  Ps)  =  0 

then  according  to  the  definition  of  ih  and  F,  we  have 


'/  \  _  lIVs 


^iPs)  = 


^5ll2 


+  D{P*s,^'s)<F{\\l3*s-^'s\\P 


2k 


which  implies 


F-Hnp's))  <  m  -  p'sWi 


Combining  the  following  result  from  right  continuous  differentiability 


and  the  strong  convexity  conditions  of  we  have 

j^i^iP's))  =  -{Ps-  PhKXsiP's  -  P*s))  <  -7\\P*s-P's\\l  <  -7F-\^{P's)), 

as  desired.  □ 


Proof  of  Lemma  5.2.  From  the  generalized  Bihari’s  inequality 

~  _  r  iti^iP's))  .  1  dx 

^  Jo  7  A(ioo) 

Note  that  'I'(O)  =  H^JHi  +  so  F“^('h(0))  <  By  continuity 

and  monotonicity  of  F{x)  on  {pF^,+oo)  and  and  'l'(ti)  > 


7n  < 


< 


. 

,  LJZlML  _L0 
9k  '^FTt 


P  . 

rmn 


2k 

<32  - 

rmn  1  o  n 
9k  '^P'rr 


dx 

F-Hx)^ 


r^dLn  dF  r 
7/32  .  X  J 

'^min  ' 


Hi  dF 


g2  X 

rmn 


< 


2k 

4  +  2  log  s 
Pmin 


dx 

~2 

min 


A 


+ 


7/32  . 

rmn 


(+^  +  -- 

2kx 


r\\h2  1 

-)dx+  /  {—  +  ^)dx 

y  .  ^2kx 


X2 


+  llo6(|^). 

Pmin 


Proof  of  f2  is  straightforward  now.  For  t  <  f2, 
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Let 


P{x)  =  - — h  2\/xs  >  F{x)  Vx  >  0 
2k 


Let  F  ^  be  the  right-continuous  inverse.  Then  \\/3{t)  —  >  F  ^(T(/?))  > 

F“^('L(/3)).  By  generalized  Bihari’s  inequality 


Therefore 


Again,  we  have: 


1  dT  1,^,  C^slogd, 


1  r^{0) 
f 2  <  - 


dx 


Noticed  that 
Therefore, 


F-'('L(0))<F-^(T(0))<|/3|i 


rF{C^slogd/n)  /■'I-(O) 

7^2  <  /  + 


/ 


dx 


JF(C^slogd/n)  F-^{x) 


< 


'■(C^slogd/n)/2K-|-2CsydogT7L 


s  log  d 


+ 


L 


dF 


C^slogd/n  ^ 


1  2  Hn 


+ 


(tt:  + 


2k  C  log  d  J C‘^s\ogd/n  2^x 


4 

< 


n  1  ,  n 

+  —  (1  +  log 


C^slogd 


Cy  logd  2k 
which  gives  the  bounds. 

Proof  of  Theorem  4.1. 

Al  =  {e  :  ||Xs(X^X5)-^X^e||2  >  2ay^slogn} 
B  =  {e-.\\{X*sXs)-^X*se\U>2a^^} 

C  =  {e  :  \\X^PTe\\oo  >  2a\ ma.x\\Xj\\n} 

V  n  jeT 


□ 
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Notetr{Xs{X*sXs)-^X*s)  =  s,{X*sXs)-^X*s-Xs{X*sXs)-^  =  ^ 

1/7,  and  X^Pt  ■  PtXt  P  X^Xt,  using  Lemma  A. 2,  we  have 


Prob(A)  < 


1 


Prob(i?)  < 


1 


Prob(C')  < 


1 


n^/n  logn’  p-y/vr  logp’  logp 

(1)  (no-false-positivity  for  /3{t)  up  to  r)  First  consider  the  LB-ISS 


(A.5) 


where  Ps  =  /35  +  (-A^A5)  ^X^e.  It  is  easy  to  conclude  \\Xs{^s  —  Ps)\\2 
is  monotonically  decreasing  based  on  the  following  observation 


<  0 


using  {dps{t) / dt,  dPs{t) / dt)  =  0  from  the  assumption  of  Bregman  ISS 
paths.  On  the  set  |J 


d  \\Xs{0s-0s)\\l  

/  dps 

d0s\ 

1 

d0s 

^  1 

d0s 

dt  2n 

\  dt 

dt  / 

K 

dt 

2 

dt 

00  ^ 


^  Pmax 
P  Pmax 

<  0* 

—  r'max 


+  W^S  -  /3sit)\\2 
\\Xs{0S  -  0sm\2 


\\Xs0sh 


2(Ti 


'logp  \\Xs0*\\2  +  2a^s\ogn 
yn 


Denote  this  upper  bound  as  B.  Returning  to  the  original  problem,  by 
Lemma  A.l,  it  suffices  to  have  for  all  t  <  r, 

1  >  \\X*tXI{ps  +  0s/k)  +  tX^^PreWoo 

The  first  part 

\\X^xI{ps  +  0s/k)\\^  <  (l-r/)(l  + 

1  —  B  /  np 


/ k)  <  1  —  (1  —  B / Kp)p 


t  <  T  :  = 


-pa  ^-y/n/logp  (  max  ||Aj| 

V  isT 


-1 


=  0{pa  ^-y/n/logp). 


On  the  set  C^,  we  have  t\\X^PTe\\oo  <  (1  —  B/Kp)p. 
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(2)  (no-false-negativity  for  the  mean  path)  it  suffices  to  ensure 

^min  >  +  f^s/ '^)||oo- 

Where  <^5  =  X^Xs-  The  second  part  on  the  right  hand  side  is  || 
/3s/«^)||oo  <  7l|‘h5^||oo(l  +  B/k).  The  first  part  is  bounded  on  the  set 
B\^ 

(3)  (f2-error  bound)  Lemma  5.2  implies  if  C  >  l|AjlU)  ^  ^ 

is  big  enough,  we  have 


i2iC) 


< 

< 

< 


4  l~n 

C'h  V 

4  rxi 

C7  V 


1 

2^7 


(1  +  log 


1 

2^7 


(1  +  log 


(J2g2  logp 

n||/3*||2  +  4ij^slogp/7 
C‘^s  logp 


Thus  3t  G  [0,  r] 

||/3s(t)  -  Ps\\2  <  C^/slog{p)/n 


Note  that  with  high  probability 

11/35  -  Ps\\2  <  2(7^5 log(p)/n7"^/^ 

(4)  (Sign  Consistency  for  /3t)  The  condition 


B*  ■  > 

r^min  — 


4it  logp 

7^/2  V  n 


implies  that  B  has  the  same  sign  as  B*  a-s  well  as  1/2\B*\  <  \Bi\  < 
3/2|/3*|  for  each  component  i.  Thus  sign  consistency  is  reached  when 
too  <  T,  or 


4+,"i°g»  +  Aiog(M 

'iBmin  ^7  Bn 


.  /3mm7  «^7 

<  T 


which  is  ensured  by  n  big  enough  and 


Bmin  —  2/3niin  ^ 


4n  8ct(2  +  logs)  (maXjgT  ||-^il|n) 


vl/2 


V 


77 


This  completes  the  proof. 


□ 
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A. 2.  Proof  of  Consistency  of  Linearized  Bregman  Iterations. 

First  of  all,  we  give  a  discrete  version  of  generalized  Bihari’s  inequality  which 
is  useful  for  Linearized  Bregman  iterations  (1.4). 


Lemma  A. 3  (Discrete  Generalized  Bihari’s  inequality).  Consider  the  LB 
{pk+i  -  Pk)  +  {h+i  -  =  -akX*sXs{(3k  -  P) 

where  X^Xg  >  7/.  Let  the  potential  (or  Lyapunov)  function  he 


4^fc  =  D(/3,/3fc)  + 


2k 


Then  the  following  differenee  inequality  holds 

'hfc+i  -  'hfc  <  -akl{l  -  Kak\\XsX*s\\/2)F-\^k) 

where  F  is  defined  by  (5.5). 


Proof  of  Lemma  A. 3.  Similar  to  continue  case,  we  have 

Wk-Ml>F-\^k). 


Since  fi-norm  is  homogeneous  of  degree  1,  its  subgradient  p  G  <9||/3||i 
satisfies  (p, /3)  =  ||/3||i.  Multiplying  —  /?  on  the  both  sides  of  iteration 
equation,  it  leads  to 


'^k+i-^k  +  {Pk+i-Pk)Pk-\\h+i-Pk\?l‘^^^  =  {h  -  P,X*sXs{(3k  -  /3)) 


Note  that  for  i  G  5,  ^  0 


||/?fc+i  —  /  K  —  2(pfc+i  —  pk)(3k 

<  ||/3a:+1  —  AII^/r  +  2(pfc+i  —  pk){(3k+l  —  Pk) 

Pi  ||/3fc+l  Pk\\  F  2(pfc_|_i  pk)iyPk+l  Pk)  F  ||pfc+l  Pk 

<  i^\\pk+i- PkF{Pk+i- Pk)/i<‘\? 

=  Kal\\X*sXs{Pk-~P)f 


^k+i  -  < 


< 

< 

< 


(^XsiPk  -  P),Xs{Pk  -  ^))  +  ll?  {^sXsiPk  -  P), 
(^XsiPk  -  P),  {I  -  KakXsX*s/2)XsiPk  -  P)) 
_^(1  _  Kak\\XsXM/2))\\Xs{Pk  - 
-ak7{l-Kak\\XsX*s\\/2))\\Pk-M^ 

-akj{l  -  Kak\\XsX*s\\/2))F-\^k) 


which  gives  the  result. 


□ 
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Next  we  present  a  discrete  stopping  time  bound  from  the  inequality  above. 

Lemma  A. 4  (Discrete  Stopping  Time  Bounds).  Consider  the  LB 

{Pk+i  —  Pk)  +  {fdk+i  —  Pk)lki  =  —akXgXsi/dk  —  13) 

where  X^Xg  >  7/  and  ak  <  a,  Vfc. 

Define 

fi  :=  inf  sign(/3fc)  =  sign(/3)  I 


and 


I  t=o 


(k-l 


f2{C) 

I  t=o 


■logpl 

n  I 


Then  the  following  bounds  hold, 

4  +  2  log  s  1 


+0  ^ 


UC)  <  77V 


7/3r7 


+  —  log(  5 - )  +3a 

fdmin 


6*7  Y  logp  2k7 
where  7  =  7(1  —  «;a||Ai5X^||/2) 


+  77^(1  +  log  1 - )  +  2a 


logp^ 


Remark  A.l.  Taking  a  ^  0,  it  recovers  the  stopping  time  bounds  in 
eontinuous  ease,  Lemma  5.2. 


Proof  of  Lemma  A. 4.  Consider 


Tfc  =  D(/3,/3fc)  + 


2k 


For  a  uniform  upper  bound  on  step  sizes  at  <  a,  by  the  discrete  Bihari’s 
inequality  in  Lemma  A. 3 

^fc+i  -  <  -afc7F“Y^fc)  <  -afc7^“Y^fc) 

where  7  =  7(1  —  Ka\\XX*\\/2)  and  F{x)  =  ^  +  2y/~xs  >  F{x),  Vx  >  0. 

For  k  such  that  >  2/3mm  +  ~0l^iJ2K,  denote  Lk  =  F  which  is 

non-increasing.  Define  tm  =  Let  ni  =  sup{n  :  Ln  >  then 

F{Lk)  -  F{Lk+i) 


lOik  < 


Lk 
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then  for  0  <  A:  <  ni  —  1, 

F(Lfc)  -  F(Lfc+i)  ^  Aog  Lk  _  2  FT  _  log  Lk+i  _  ^  j  s  , 
Lk  ~  2k  y  Lk  2k  Y  Lk+i ' 


This  is  because  of 


Lk  —  Lk+i  ,  Lk  . 

- j -  <  log(- - , 

Afc  i^k+l 


using  1  —  X  <  —  log  X  for  X  <  1 ,  and 


^/Ljk  —  \J Lk+i  ^  ^/Lk  —  \J Lk+i  _  1 

Lk  \/TFt\/  Lk+1  \/Lk, 


'  “  ^  2k  y  Lo’  ^  2k  y  LnF 

^  ^log||/3f  ^  O  r~^ 

-  '  “Vllfflld-*  2k  -NsUl 


Let  n2  =  sup{n  :  Ln  > 


7«fc  S 


F{Lk)  -  F{Lk+i) 


Then  similarly,  we  have 

1  2 

l{tn2  -  Ani+l)  <  (^3-  +  5 - )(logL,i2+l  -  logL^a) 

Pmin 

<  {^  +  )  (log  Pmin  -  log  Fmin) 

Pmin 

Let  ns  =  sup{n  :  >  /3^.^/2n} 


,  7  Tfc-Tfc+i 

7(An3-An2  +  l)  <  X]  - 32 - 

k=n2+l  Pmin 

52  ~  52 

min  I  O  /Q  .  min 

^  ‘2k.  ^Hmin  2k 
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To  sum  up,  we  have 


~  ^  /  4  +  21ogs  1  J|/3||2^  ,  ,, 

n  <  tng+i  <  —75 - 1-  —  log(  3 - )  +  3a 

'yPmin  Pmin 


Similarly,  we  have 


4  I  n  1  .  ,  , 

^2{C)  <  -p^\  TDTI  j) 


Cy  y  logh  2kj 
which  ends  the  proof. 


C'^s'^logd 


□ 


Proof  of  Theorem  4.2.  The  proof  is  the  same  to  the  continue  case. 
The  only  difference  is  the  decreasing  of  ||X(/3fc  —  /3)||2  needs  the  condition 
KaWXsX'^W  <  2. 

Consider  the  LB 

{pk+i  -  Pk)  +  {h+i  -  h)/!^  =  -akX*sXs{(3k  -  P) 
where  X^Xg  >  7/. 

\\Xs{h+i-M-\\Xs{h-M 
=  \\Xs{h+i  -  h)f  +  2(/3fc+i  -  f3kfX^Xsi(3k  -  /3) 

=  \\Xs{/3k+i  —  Pk)\f  —  2n/aA;(/?fc+i  —  j3k)'^[{pk+i  —  Pk)  +  {Pk+i  —  Pk)/i^] 

<  \\Xsif3k+i  —  Pk)\\‘^  —  2n/afc(/3fc+i  —  jBk)^  {IBk+i  — 

=  n(/3fc+i  -  ^k)'^{X*sXs  -  2/aK^){h+i  -  h) 

<  0, 

where  we  have  used  =  ||XgX5||.  Hence  ||X5(/35  —  I3k)\\2  is  mono- 

tonically  nonincreasing.  □ 


Note  that  this  implies  that  ||rt||  :=  \\y  —  X^tW  is  monotonically  nonin¬ 
creasing  for  all  t  G  (0,  f).  The  following  lemma  makes  it  precise. 


Lemma  A. 5.  For  t  G  [0,  r],  the  residue  admits  an  orthogonal  decompo¬ 
sition 


y-X/3tf  =  \\Xs0*s-/3sit))f  +  \\PTef 


and  is  monotonically  nonincreasing. 
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Proof.  By  Pythagorean  Theorem, 

Wnf  =  \\Xsi/3*-Pt)  +  ef  =  \\PsXsil3*-/3t)  +  Psef  +  \\{I-Ps)ef 
=  \\Xsi(3*  -  Pt)  +  Xs{X*sXs)-^X*sef  + 

=  \\XsCP*s-l5sm?  +  Ce,s 

and  the  conclusion  follows  from  that  ||X5(/3J  —  /35(t))||  is  monotonically 
nonincreasing.  □ 
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